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Abstract. This paper addresses the issue of policy evaluation in Markov 
Decision Processes, using linear function approximation. It provides a uni- 
fied view of algorithms such as TD(X), LSTD(X), iLSTD, residual- gradient 
TD. It is asserted that they all consist in minimizing a gradient function 
and differ by the form of this function and their means of minimizing 
it. Two new schemes are introduced in that framework: Full-gradient 
TD which uses a generalization of the principle introduced in iLSTD, and 
EGD TD, which reduces the gradient by successive eqm-gradient descents. 
These three algorithms form a new intermediate family with the interest- 
ing property of making much better use of the samples than TD while 
keeping a gradient descent scheme, which is useful for complexity issues 
and optimistic policy iteration. 

1 The policy evaluation problem 

A Markov Decision Process (MDP) describes a dynamical system and an agent. 
The system is described by its state s 6 S. When considering discrete time, the 
agent can apply at each time step an action u G U which drives the system to a 
state s' = u(s) at the next time step, u is generally non-deterministic. 

To each transition is associated a reward r £ 1Z C K. A policy n is a function 
that associates to any state of the system an action taken by the agent. 

Given a discount factor 7, the value function v 71 of a policy ir associates to 
any state the expected discounted sum of rewards received when applying tt from 
that state for an infinite time: 



This paper addresses the evaluation of a policy by approximating the value 
function as a linear combination of fixed features, and estimating the coefficients 
from sampled trajectories (sequences of visited states and received rewards when 
starting from a certain state). 




All the information on v contained in a trajectory Sq 




n 
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lies in the following system of Bellman equations: 

{v(s ) = r +7i>(si) 
v(s n -i) = r n _i+7u(s„) 



The equalities are abusive when the actions are not deterministic, but averaging 
these equations converges to valid equations as the number of samples tends to 
infinity. 

The policy evaluation problem consists in finding a function that satisfies the 
most this system (which may include several trajectories). This can be achieved 
in several ways. In the following, all major methods are described in a single 
and simple framework: 

• define a gradient function /x of the observed transitions and parameters; 

• update its value whenever a new transition is observed; 

• whenever needed, modify the parameters in order to reduce fj,, and then 
update its value. 

Section [2] discusses the two currently used gradient functions and their mean- 
ing. Section [3] presents the TD algorithms - TD(\) 1 and residual- gradient 
TD [2] - in that framework. Section H shows that LSTD(X) [3] and LSPE(X) [4] 
and their Bellman-residual versions share the same kind of derivation. Section[5] 
discusses a third family of algorithms that use an intermediate update scheme 
(full gradient). It includes iLSTD [5j [6] and two algorithms introduced in this 
paper: Full-TD and Equi-gradient descent TD. Section [5] presents experimen- 
tations made on the Boyan chain MDP, which illustrate some of the benefits 
and drawbacks of each method. Finally, the conclusion discusses the potential 
advantages of the full gradient scheme for optimistic policy iteration. 

Complete proofs of the equivalences of these formulations with the original 
ones and derivation of the equi-gradient descent algorithm are exposed in [7l [8] . 

2 Fixed-point gradient vs. Bellman-residual gradient 

The TD(0) algorithm estimates v iteratively by using its current estimate v to 
approximate the right hand side of these equations: 

v(s t ) = n + yv(s t +i) => v(s t ) ~ r t +jv(s t+ i) 

=> v(s t ) - v(s t ) ~ r t - v(s t ) + 7u(s t+ i) 

and consequently updating v(st) <— v(st) + a(r t — v(st) + 7i)(st+i)) 

TD(X) averages such approximations of v(st) on all "dynamic programming 
ranks". It can be seen as expanding the system to all implicit equations: 



v(s ) = r +^v(si) = r +j(r 1 +-fv(s 2 )) = . .. = r +7(ri+7(r 2 + . . . +jv(s n ))) 



and again replacing v by v in the right hand sides. The different estimations of 
v(st) are averaged using coefficients determined by a value A G [0, 1], which leads 




to estimating v(s t ) — v(st) by Y^r=t(^l) T ~ t ( r T — v{s T ) +7«(s T +i)). This error 
signal is again used to update v(st). In the case of linear approximators, the 
vector of error signals on v(sq), . . . , v(st-i) can be written as L(r — B«fro;) = 
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They are projected on the parameter u> of v by 3> T L(r — B<&cl>). This gives what 
one can call a fixed-point gradient, which is the sum of these on all trajectories 
(ie. the same expression with adequately extended vectors and matrices). 

Another way of doing is to aim at solving the Bellman system, ie. minimize 
||r— B$w||2 w.r.t. uj. This gives the Bellman-residual gradient 3> T B T (r — B«I?u;). 

The conceptual difference is simple: The fixed-point gradient transforms 
the errors on transitions (temporal differences) on the approximate value func- 
tion itself (ie. errors on single states) by a multi-rank dynamic programming 
scheme, and then projects these estimated errors on the parameter u>, whereas 
the Bellman-residual gradient does a direct projection. 

The iterative computation of these gradients proceeds according to the fol- 
lowing way: the components of the vector r — B*&cl> are the successive temporal 
differences d t — r t — v(s t ) + 7?)(s t+1 ); the columns of $ T L or $ T B T are referred 
to as the eligibility traces z t in the first case - this denomination will be extended 
here to the second case. Each new sampled transition modifies the gradient fi 
by fit <— A*t-i + d t Zt, Zt itself being computed iteratively. 

These gradients, as well as v, are linear in w: fi = Alj + b, with b = <f> T Lr, 
and A = $ T LB$ in the fixed-point case, or A = $ T B T B$ in the Bellman- 
residual case. 

In the following, let us note 8 U the additive term of any update of u) in the 
algorithms. 



3 TD algorithms 

TD(X) pp, in its purely iterative form, performs the following update after each 
transition: uj <— uj + ad t z t . Equivalently, the updates can be performed only 
after each trajectory, which is more consistent with its definition. Depending 
on one's view (related to the backward/forward views discussed in pQ), the first 
scheme can be considered as the natural one and the second as cumulating 
successive updates before commiting it at the end, or the second one can be 
seen as more natural (given the explanation in the previous section) and the 
first one as a partial update given the partial computation of /i. Note that here, 
(i only concerns the current trajectory: the updates performed in TD(\) only 
take into account the last trajectory. 

Let us take a neutral point of view and state that the algorithm considers 
the gradient on the current trajectory and update weights at any chosen time 
(but necesseraly including the end of the trajectory) by oj <— uj + afi followed 



by (i <— 0: fi is computed iteratively, and each time a partial computation has 
been used, it is "thrown away" . At the end of each trajectory, the associated 
gradient has been used for one update ui <— u) + afi and is then forgotten. 

To summarize, given the fixed-point gradient function ^{observed transitions, u>), 
TD(X) updates fi after each transition (as exposed in previous section), and - 
whenever wanted- performs a parameter update uj *— u) + a[i followed by fi <— 0. 

The residual- gradient TD algorithm [2] is actually the same algorithm, only 
using the Bellman-residual gradient. 



It has been shown in [9] that uj converges in TD(X) to uj* such that h(uj*) = 
Auj* + b = 0. This lead to the LSTD(X) algorithm 3 which, given sampled 
trajectories, directly computes uj* = A _1 b. 

For various motivations like numerical stability, use of optimistic policy iter- 
ation, the possible singularity of A, smooth processing time, or getting a specific 
point of view on the algorithm, the computation can be performed iteratively. 
The algorithm can then be described as follows: 

• for each new transition, update fj, as exposed in section[2] and update A -1 
(using Shcrmann-Morrisson formula), 

• whenever wanted, reduce fi by updating uj <— uj + A 1 /x. u> is then the 
exact solution of ^{samples so jar,uj) = and \x is updated to 0. 

Again, the same algorithm can be applied using the Bellman-residual gradi- 
ent. 

[1] introduced a similar algorithm, namely Least Squares Policy Evaluation. 



updating fi <— fi — AS^ . 

5 Full-gradient algorithms 

Three algorithms are presented in this section that all rely on the same idea: 
reduce \x (again at any time) in a gradient descent way, but maintain its "real" 
value: instead of zeroing it after each update, which corresponds to forgetting 
each trajectory after only one gradient descent step on its contribution to the 
overall gradient fj,, the residual of the gradient is kept, and thus the following 
updates not only perform one gradient descent step on the current trajectory, 
but also continue this process for the previous ones. 

The first natural algorithm is introduced here as Full-gradient TD and con- 
sists in replacing //<-0by/j<-/i - ASu) in the TD algorithm. 

The iLSTD algorithm was introduced in [5J [3] (as well as the notation (j,). 
Although it is presented as a variation of LSTD (hence its name), it is most 
related to gradient descent than to the exact least-squares solving scheme. With 
the "any-time update" generalization used throughout this article, it can be 
described as a full-gradient TD in which uj is updated only on its more correlated 
component: ioi <— LUi + a/ii, with i — arg max 



4 LSTD algorithms 



The difference 




consequently 



and so on. The constraint 



Finally, the equi-gradient descent (EGD) TD, introduced here, consists in 
taking EGD [8] steps as an update scheme. In a few words, EGD also consists in 
modifying only the most correlated parameter u>i , but a is chosen such that after 
this update, another parameter u>j becomes equi-correlated. The next update is 
Wi \ ^_ / uji \ ^ ( An A tj \ 1 / fa 
~\. / V u 3 ) 2 \ A 3i Ajj J \ ii, 
is that to allow the exact computations of the step lengths, fi must not be 
modified (by new samples) in between those steps. So a typical update schedule 
is to perform a certain number of steps at the end of each trajectory, preferably 
to one or a few steps after each transition. 

The benefit exposed in the first paragraph comes at the cost of maintaining 
the matrix A, which has the same order of complexity as maintaining A -1 in 
LSTD, but is still about half less complex. However, as exposed in [5], if the 
features are sparse (states have a non-zero value only on a subset of the features), 
the complexity of the two last algorithms can be lowered, unlike in LSTD. 

EGD TD presents the crucial benefit of not having to tune the a update 
parameter of gradient descent schemes. Instead of setting the lengths of descent 
steps a priori and uniformely, and cross-validate them, they are computed on 
the fly given the data. 



6 Experiments 

Experiments were run on a 100 states Boyan chain MDP [3]. Details are exposed 
in [TJ. The fixed-point gradient was used, with A = 0.5. Here are plotted 

• in [I] the RMSE against the number of trajectories, which illustrates the 
differences between full exploitation of the samples (least-squares and full- 
gradient methods) and TD, 

• in [21 the RMSE against the computational time, where the three families 
are clearly clustered. Note that the sparsity of the features has not been 
taken into account, and EGD TD and iLSTD can perform much better on 
that point, as experimented in [5] for the latter. 
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Fig. 1: Root mean squared error against the number of trajectories 
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Fig. 2: Root mean squared error against the computational time 
7 Summary and perspectives 

The classical algorithms of reinforcement learning have been presented here in a 
view both practical and enlightning. This view allows a natural introduction of a 
new intermediate family of algorithms that performs stochastic reduction of the 
errors, as in TD, but make full use of the samples, as in LSTD. Let alone the time 
or sample complexity, these methods open interesting perspectives in the frame 
of optimistic policy iteration. Indeed, the principle of neither forgetting samples 
after a small update, nor directly fully take them into account, may allow to 
make a better use of samples than TD while avoiding the issue met by LSTD 
in that frame: making too much case of samples coming from previous policies. 
This can be achieved by scaling /i by a discount factor after each trajectory (for 
example), which amounts to reducing only a given ratio of it. 
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